-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio transcription support #781
Conversation
cfa71d8
to
a4d2383
Compare
6d9af68
to
c055935
Compare
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside the scope of this PR but I'm starting to doubt our pattern of creating Engines that are just thin wrappers over Drivers. Engines should augment 1 or more Drivers for a specific use-case that a single Driver cannot do on its' own.
What do you think?
CC @vasinov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly agree, this layer is way too thin, but I also did not want to create a weird single audio-to-text modality that breaks our established pattern. Optimistically, they're a layer where we can add additional functionality. For now, they're mostly an additional initialization step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, just naming comments
griptape/drivers/audio_transcription/base_audio_transcription_driver.py
Outdated
Show resolved
Hide resolved
|
||
|
||
@define | ||
class BaseAudioTranscriptionDriver(SerializableMixin, ExponentialBackoffMixin, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name these Drivers BaseSpeechToTextDriver
for consistency with the inverse Drivers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've steered a bit too far in the direction of naming drivers based on their artifact interfaces. I think the specificity of this name is helpful, what do you think about adding a similarly specific name to the BaseTextToSpeechDriver
? BaseSpeechGenerationDriver
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thinking about it more I think this current convention is the more "correct" one.
Down to do a rename of BaseTextToSpeechDriver
(though maybe in a separate PR), what do you think about BaseAudioGenerationDriver
? This may extend beyond speech in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
AudioLoader
, responsible for producing anAudioArtifact
from binary audio dataAudioTranscriptionTask
, accepting anAudioArtifact
and responding with aTextArtifact
containing the transcribed textAudioTranscriptionClient
allowing an Agent to generate a transcription from audio in memory or on the filesystemOpenAiAudioTranscriptionDriver
to support integration with OpenAI's speech-to-text modelsOpenAiAudioTranscriptionDriver
orDummyAudioTranscriptionDriver
📚 Documentation preview 📚: https://griptape--781.org.readthedocs.build//781/